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Abstract —Deep LSTM is an ideal candidate for text recog¬ 
nition. However text recognition involves some initial image 
processing steps like segmentation of lines and words which 
can induce error to the recognition system. Without segmen¬ 
tation, learning very long range context is difficult and becomes 
computationally intractable. Therefore, alternative soft decisions 
are needed at the pre-processing level. This paper proposes a 
hybrid text recognizer using a deep recurrent neural network 
with multiple layers of abstraction and long range context along 
with a language model to verify the performance of the deep 
neural network. In this paper we construct a multi-hypotheses 
tree architecture with candidate segments of line sequences from 
different segmentation algorithms at its different branches. The 
deep neural network is trained on perfectly segmented data 
and tests each of the candidate segments, generating Unicode 
sequences. In the verification step, these Unicode sequences are 
validated using a sub-string match with the language model and 
best first search is used to find the best possible combination 
of alternative hypothesis from the tree structure. Thus the 
verification framework using language models eliminates wrong 
segmentation outputs and filters recognition errors. 

1. Introduction 

Most Optical Character Recognition (OCR) algorithms as¬ 
sume perfect segmentation of lines and words, which is 
not true. In Indie scripts, the presence of vowel modifiers 
and conjucts furthur aggrevate the errors in segmentation 
as these modifiers are present in the upper or lower zone. 
This makes the text layout dense and decreases the interline 
separation. This paper proposes a text recognition framework 
to hypothesize and verify the sequences obtained from multi¬ 
ple segmentation techniques using a deep BLSTM network 
and a language model to verify the performance of the 
deep neural network. In this paper we aim to find the best 
possible recognition of word sequences by searching sub¬ 
strings of words derived from multiple segmentation routines. 
We construct a hypothesize-and-verify framework in which 
candidate segments of word sequences derived from multiple 
segmentation routines are at different branches. A deep re¬ 
current neural network is trained on perfectly segmented data 
and tests each of the candidate segments, generating Unicode 
sequences. This work is an extension of the work on printed 
text recognition using Deep BLSTM wherein Deep BLSTM 
architecture for text recognition was proposed Q. In the 
verification stage these Unicode sequences are validated using a 
sub-string match with the language model and best first search 


is used to find the best possible combination of alternative 
hypothesis from the tree structure. The search region uses a 
spatial context considering the preceeding and suceeding word 
to find the best match. This algorithm is able to learn the 
sequence alignment, solving the Unicode re-ordering issues. 
This verification framework eliminates insertion and deletion 
errors of the recognizer due to the sub-string match with 
the n-grams. This is a segmentation free script independent 
framework and in this paper we presents results on Oriya 
printed text. The language model is independently learnt on 
the script under recognition and character n-grams are saved. 
Oriya script is used due to the unavailability of OCR for this 
script and due to the challenges involved such as the huge 
number of classes and shape complexities of the script. 

The paper is organized as follows: Section 2 gives a brief 
review of the work done in this area. Section 3 presents the 
Deep BLSTM architecture in detail followed by Section 4 
where the data processing and multi-hypotheses framework is 
discussed. The experimental results are presented in Section 5 
followed by conclusion in Section 6. 

II. Related Work 

Text recognition algorithms have traditionally been segmen¬ 
tation based where lines are segmented to words and finally 
characters which get recognized by the use of classifiers. 
Such approaches have high segmentation error and do not 
use context information. The main causes of such errors arise 
from age and quality of documents where inter-word and 
inter line spacing, ink spread and background text interference 
cause segmentation errors in turn affecting overall recognition 
accuracies. In segmentation free approaches sequential classi¬ 
fiers like Hidden Markov Model(HMM) and graphical models 
like Conditional Random Eields(CRE) have been used. These 
algorithms introduced the use of context information in terms 
of transition probabilities and n-gram models, thus improving 
recognition accuracies 121. But these approaches mostly do not 
work with unsegmented words, if some do, they are restricted 
since they use a dictionary of limited words. 

Long Short Term Memory based Recurrent Neural network 
architecture has been widely used for speech recognition 0, 
0, text recognition O, social signal prediction ||6l, emotion 
recognition m and time series prediction problems since it 
has the ability of sequence learning. LSTM has emerged 



Fig. 1: Block diagram of Recognition Architecture 


as the most competent classifier for handwriting and speech 
recognition. It performs considerably well on handwritten text 
without explicit knowledge of the language and has won 
several competitions dl, ID. LSTM has been used for the 
recognition of printed Urdu Nastaleeq script cni and printed 
English and Fraktur scripts CD. RNN based approaches have 
been popularly used for Arabic scripts wherein segmentation 
is immensely difficult ifT^ . LSTM based approaches have 
outperformed HMM based ones for handwriting recognition 
proving that learnt features are better than handcrafted features 
ca. With the advent of Deep learning algorithms, deep belief 
networks and deep neural networks are gaining popularity due 
to their efficiency over shallow models flAh . 

OCRs for Indie scripts are not as robust as that of Roman 
scripts since most of the algorithms used for Indie script 
recognition are segmentation based and script dependent CD 
Recognition of Indie script becomes challenging due to various 
problems as stated here. The nature of Indie scripts is very 
complex giving rise to huge number of symbols (classes) 
including basic characters, vowel modifiers, conjuncts formed 
out of two or more character combination. If the text is noisy 
or the document is degraded, the recognition suffers badly due 
to segmentation faults at line and word level. Traditionally, 
different handcrafted features have mostly been used for text 
recognition of Oriya ifTbl . Bangla lfTTll and classifiers like 
HMM (TSI, SVM and CRF has been widely used. Naveen 
et al presented a direct implementation of single layer LSTM 
network for the recognition of Devanagiri scripts CD, Eo) 
and further experimented on more Indie scripts CD. 

HI. Recognition Architecture 

For document image recognition we need to segment the 
full image in order to localize text blocks, lines and words. 
This segmentation of lines and words from a page induces 
a lot of error depending upon the age and quality of the 
document, scanning technique and several such reasons. But 
a completely segmentation free approach is difficult because 


learning very long range sequences can become computation¬ 
ally intractable. Thus at the pre-processing level we incor¬ 
porate several such segmentation algorithms and use a soft 
decision based multi-hypothesis architecture for choosing the 
best possible recognized sequence. In this work we have used 
three standard segmentation algorithms: Hough Transform, 
Geometric projection and Interval tree based segmentation Gl. 
Candidate word segments from each segmentation algorithm is 
passed through the Deep BLSTM recognizer which has been 
trained on perfect word sequences. The Deep BLSTM network 
generates output sequences corresponding to the segments. In 
the multi-hypotheses framework we have same line segments 
of a page derived out of different segmentation schemes in 
the different branches and then these sequences are matched 
using a language model to refine and pick the best sequence. 
A block diagram of the multi-hypothesis architecture is shown 
in figure 1. 

The main motivation for this hypothesize-and-verify frame¬ 
work comes from the fact that in a test case where we have 
erroneous segmentation, how do we make the best use of the 
different segmentation algorithms and the rules of the script to 
improve recognition. We know that words do not combine in a 
random order and necessarily follow grammar of the particular 
script. The order of words determine the grammar and can be 
learnt from an ideal context model. Since character n-grams 
are better primitives and have been widely used for retrieval 
purposes we have used character n-grams instead of word n- 
grams. Creating a n-gram model of words is difficult as it 
requires a dictionary of all possible words of the script, which 
is not available for most Indie scripts. The language model is 
learnt separately for the script to introduce language statistics 
for rejecting the invalid n-grams and picking the best possible 
output sequence. Each primitive be it basic character, vowel 
modifier or conjunct follow a certain order to form a valid 
word. Here we use trigram and 4-gram and parse the sequence 
to find the corresponding matches. 

During the sub-string search, if there exists a substring 






































which is not present in the trigram and 4-gram, the substring 
is considered to be invalid. A cumulative matching score is 
defined along with a penalty in mismatch with trigram or 4- 
gram sequence. During faulty line segmentation the upper zone 
or lower zone primitives (mostly vowel modifiers) combine or 
miss the original line, thus creating errors. Errorneous word 
segmentation can also lead to broken characters or parts of 
word missing. Such words have faulty Unicode sequence due 
to misplacement or addition of upper or lower zone primitives 
and thus get misrecongized. These can be taken care of by 
learning valid combination of characters and we select the 
best possible sequence out of the different pathways of the 
multi-hypothesis pipeline. These sequences are verified by 
using language statistics of a script to find the best possible 
word. This verification on a multi-hypothesis framework using 
language models eliminates segmentation errors and is main 
contribution of this work. 

IV. Learning framework 

A. Recurrent Neural Networks 

Deep Recurrent Neural networks (RNNs) have emerged 
as the very competant classifier for text and speech recog¬ 
nition and Long Short Term Memory has been the most 
successful recurrent neural network architecture. Bidirectional 
Long Short Term Memory (BLSTM) has the capability to 
capture long range context and has succesfully overcome the 
limitations of standard RNNs like vanishing gradient and need 
of pre-segmented data. LSTM uses multiplicative gates to trap 
the error so that a constant error fiow is maintained. This 
phenomenon is called Constant Error Caraousal and helps 
overcome the vanishing gradient problem. Bidirectional LSTM 
enables accessing longer range context in both directions 
using forward and backward layers (221. Graves etal 
proposed a training method known as Connectionist Temporal 
Classification that could align sequential data and thus avoided 
the need of pre-segmented data. LSTM has emerged as a very 
successful architecture and is being widely used as a robust 
OCR architecture for printed and handwritten text (24l. Deep 
networks have outperformed single layer LSTM for speech 
recognition f25\ , flSIi motivating the use of Deep LSTM 
architectures for text recognition. 

B. Deep BLSTM 

Deep feedforward neural networks refers to having multiple 
non-linear layers between the input and output layer. But in 
case of LSTM which is a recurrent neural network, the same 
principle cannot be applied directly due the temporal struc¬ 
ture of RNNs. We construct a deep BLSTM architecture by 
stacking multiple hidden layers to increase the representational 
capability of higher order features. RNNs add temporal context 
to the learning and LSTM’s internal cell architecture with the 
forget gate preserves the state over time. The implementation 
of deep LSTM with N layers is as follows. This architec¬ 
ture primarily has three bidirectional LSTM layers (BLSTM) 
used as the three hidden layers(N=3) stacked between the 
input(N=0) and output(N-i-lth) layers. 
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where all superscripts indicate the index of the layer and 
subscript t denotes the time frame. W is weight matrix, b is 
the bias, is the hidden layer activation of each memory 
cell at time t of nth unit (n =1,...N). denotes the activation 
function of the LSTM. Bidirectional LSTM has been used so 
that previous and future context with respect to current position 
can be exploited for sequence learning in both the forward and 
backward direction in two layers. To create a deep BLSTM 
network the interlayer connections should be made such that 
the output of each hidden layer (consisting of a forward and 
backward LSTM layer) will propagate to both the forward 
and backward LSTM layer forming the succesive hidden layer. 
The stacking of hidden layers helps obtain higher level feature 
abstraction. We have used 36K words for training and lOK for 
testing. Lor speedups in the training procedure we harness the 
power of multicore CPUs by redesigning LSTM as a threaded 
implementation using OpenMP and BLAS routines. 



Pig. 2: Block Diagram of Deep BLSTM architecture 

C. Network Parameters 

The neural network uses CTC output layer with 162 units 
(161 basic class labels and one for blank). The network is 
trained with three hidden bidirectional LSTM layers separated 
by feedforward units with tanh activation. Several experiments 
have been performed by varying the number of hidden units in 
each hidden layer. The feedforward layers have tanh activation 
function and the CTC output layer has softmax activation 
function. The network is trained with a fixed learning rate of 
10“^, momentum 0.9 and initial weights are selected randomly 
from [-0.1,0.!]. The total number of weights in the network 
are 154135. Bias weights to read, write and forget gates 
are initialized with 1.0, 2.0, -1.0. The output unit squashing 
function is a sigmoid function. CTC error has been used as 




































the loss function for early stopping since it tends to converge 
the fastest thereby training time decreases with decrease in the 
number of epochs. For BLSTM network we use RNNLIB a 
recurrent neural network library IZTl . 

V. Dataset 

Indie scripts have huge number of classes due to the pres¬ 
ence of basic characters, vowel modifiers and conjuncts. These 
conjuncts and vowel modifiers are composed of more than 
one Unicode, thus learning the alignment of Unicodes becomes 
important. This necisitates the usage of Unicode re-ordering 
or post-processing schemes but LSTM using CTC output 
layer is able to learn the sequence alignment. Recognition 
of Oriya characters is very challenging due to the presence 
of large number of classes and highly similar shapes of 
basic characters. Pages are scanned from several books with 
different fonts at 300 dpi resolution and are binarized using 
Sauvola binarization. The pages do not have any skew but are 
heavily degraded as the books are very old. The foreground 
text has significant intereference from the background text due 
to thin pages. Raw binarized image pixels are used as input 
features by the network. 

VI. Results 

For end to end recognition, different segmentation algo¬ 
rithms were used. In a traditional OCR workfiow, the recog¬ 
nition accuracy suffers due to the presence of segmenta¬ 
tion errors either at line /word/character level. The proposed 
framework gives us the freedom to choose from alternate 
segmentation hypothesis. The segmentation algorithms used as 
alternate hypothesis are complimentary in nature. As shown 
in table 1, individually Interval tree based segmentation(IT) 
performs worst in comparison with Geometric profiling and 
Hough transform. But by using all three as different branches 
for alternate segmentation, we observe better results in terms 
of both character and word recognition accuracies. In this 
paper we do not aim to bring the best segmentation algo¬ 
rithm, rather intend to use different segmentation pathways 
in order to improve recognition. Most errors arose from line 
segmentation, although we have observed some merged and 
broken words from the word segmentation routines. In case 
of interval tree based segmentation, the upper and lower 
zone characters got separated from their line and appeared 
as a different line thus increasing the number of lines. In 
case of Hough transform we use certain heuristics to restrain 
the line height to a average line height calculated over the 
training pages. Geometric profiling based methods worked 
better in comparison to other algorithms considered for line 
segmentation but it has immense usage of heuristics and spatial 
constraints. All these errors make it difficult to compare words 
with other words from different hypothesis since the number 
of lines and words out of each hypothesis is different. To solve 
this problem, we use a neighborhood search while traversing 
across a sub-string in search of valid of n-grams. If there is 
a mismatch in the different pathways, mostly this problem 
gets cascaded in successive words to generate more errors. By 


performing a search with preceding and succeeding words, 
we have been able to successfully solve such errors. The 
incorporation of context search benefited the framework as 
explained by an illustration in 1st row of figure 3. In this case 
we had two words which were segmented as a single word 
by IT and Hough Transform but as two different words by 
profiling. Due to the use of context search we could find the 
corresponding word in the next node and thus recognition is 
correct. We observed that mostly the sequences were picked 
from geometric profiling but in case of words which did 
not have upper or lower zone characters, interval tree based 
segmentation complimented the other hypotheses and resulted 
in correct sequences. In the 2nd row of figure 3, the word 
image is not discernable and is also misrecognized as a similar 
modifier with the exact shape exists. This had been correctly 
recognized since IT performed better on such middle zone 
characters. If a sub-string does not match with an n-gram, 
an error penalty is imposed and matching would continue for 
each word across all pathways. At each node the word with 
least error would be picked as the best word. 
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Fig. 3: Segments from individual segmentation algorithms and 
results from proposed framework 

Due to the use of alternate hypothesis in finding the best 
word, this framework is able to take care of insertions and 
deletions which mainly arise out of the recognizer. When 
the substring is valid according to the n-grams but there is 
a substitution of any one or more than one primitives then 
this framework is unable to detect it. As we are working with 
full unsegmented words, the presence of a valid n-gram does 
not necessarily enforce correct recognition as there might exist 
a similar n-gram with some substitution which is also valid. 
Figure 4 shows parts of page images where there is a huge 
line segmentation error(highlighted in red boxes). This occurs 
due to the presence of lower zone modifiers in the upper line 
and upper zone modifiers in the lower line, which decreases 
the interline gap. In such cases the alignment of words also 
gets distorted due to change in number of lines and words in 
different hypotheses. Our framework consistently solves such 
issues due to the use of neighborhood during best first search 
and proves to be extremely effective. 

We test the pages obtained using the proposed framework 
and calculate character and word recognition error which is 
given below in table 1. Due to the multiple hypotheses and 
verification framework we are able to obtain very high word 
recognition error. 
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Fig. 4: Figure show parts of pages where lines get merged due to lower zone of upper line and upper zone characters of lower 
line 


TABLE I: Test Results 


Method 

Label Error(%) 

Word error rate 

Geometric profiling 

14.10 

16.301 

Interval tree based Segmentation 

30.22 

35.06 

Hough Transform 

22.24 

28.49 

Proposed Framework 

8.64 

10.64 


VII. Conclusion 

This paper proposes a text recognition framework which 
uses multiple segmentation algorithms as different hypotheses 
generators, recognizes each segment using a deep BLSTM 
network and verifies the performance of the deep neural 
network with a learned language model. In this work we seg¬ 
ment words from a page using different segmentation routines 
and the best word is selected using best-first search over a 
spatial neighborhood to avoid alignment issues. The proposed 
framework obtained very high word recognition rate due to the 
use of alternate segmentation and verification using n-grams 
which helped filtering recognition errors. This framework is 
highly suitable for degraded documents wherein segmentation 
algorithms are the main causes of error. This framework is very 
effective in case of insertion and deletion errors introduced 
by the recognizer. If the segmentation algorithms considered 
are complimentary, the recognition error of the hybrid can 
be expected to be much less than the best segmentation 
framework. Deep BLSTM helps in recognizing sequences of 
words and also learns the alignment of Unicodes, which is a 
challenge in Indie scripts. This work could be extended to 
recognize and verify longer text sequences. 
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